-
Notifications
You must be signed in to change notification settings - Fork 11
Add support for jsonl.zst files #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vinci1it2000
wants to merge
2
commits into
rtuszik:main
Choose a base branch
from
vinci1it2000:support-jsonl.zst
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
- Using
endswithon the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection. - You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new download_md5 logic drops the previous region‐based MD5 URL derivation—please verify that custom or region-specific tar.bz2 URLs still resolve their checksums correctly.
- Using `endswith` on the download URL to detect file types can be brittle (e.g., uppercase extensions or query params); consider parsing the path or MIME type for more robust format detection.
- You might extract the shell‐command construction for different formats into a small helper or strategy class to simplify adding future formats and make the code more maintainable.
## Individual Comments
### Comment 1
<location> `src/downloader.py:284-291` </location>
<code_context>
def download_index() -> str:
- output_file = "photon-db-latest.tar.bz2"
download_url = get_download_url()
+ if download_url.endswith(".jsonl.zst"):
+ output_file = "photon-data-dump.jsonl.zst"
</code_context>
<issue_to_address>
**suggestion:** The new download_md5 logic assumes all download URLs can be suffixed with '.md5'.
If the server does not support '.md5' suffixes for all URLs, this logic may fail. Consider adding error handling or validating the URL format before appending the suffix.
Suggested implementation:
```python
def download_md5(download_url: str) -> str:
"""
Attempts to download the MD5 file for the given download_url.
Handles cases where the .md5 file is not available.
Returns the path to the downloaded MD5 file, or raises an exception if not found.
"""
import requests
md5_url = download_url + ".md5"
try:
response = requests.get(md5_url, timeout=10)
response.raise_for_status()
md5_file = os.path.join(config.TEMP_DIR, os.path.basename(md5_url))
with open(md5_file, "wb") as f:
f.write(response.content)
return md5_file
except requests.exceptions.HTTPError as e:
if response.status_code == 404:
raise FileNotFoundError(f"MD5 file not found for URL: {md5_url}")
else:
raise
except Exception as e:
raise RuntimeError(f"Failed to download MD5 file: {e}")
```
You will need to update any calls to `download_md5()` to pass the `download_url` as an argument.
If you want to validate the URL format before appending `.md5`, you can add a check to ensure the URL is a valid file download URL (e.g., ends with a known extension).
</issue_to_address>
### Comment 2
<location> `src/filesystem.py:24-27` </location>
<code_context>
if not os.path.exists(config.TEMP_DIR):
logging.debug(f"Creating temp directory: {config.TEMP_DIR}")
os.makedirs(config.TEMP_DIR, exist_ok=True)
+ if index_file.endswith(".jsonl.zst"):
+ install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
+ else:
+ install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"
</code_context>
<issue_to_address>
**suggestion:** Extraction logic for '.jsonl.zst' files assumes zstd and Java are available and compatible.
Consider checking for zstd and Java availability at startup, and enhance error messages to guide users on resolving missing or incompatible dependencies.
```suggestion
def check_executable_available(executable_name):
from shutil import which
if which(executable_name) is None:
raise RuntimeError(
f"Required executable '{executable_name}' not found in PATH. "
f"Please install '{executable_name}' and ensure it is available in your system PATH."
)
if index_file.endswith(".jsonl.zst"):
check_executable_available("zstd")
check_executable_available("java")
install_command = f"zstd --stdout -d {index_file} | java -jar /photon/photon.jar -nominatim-import -import-file - -data-dir {config.TEMP_DIR} {config.BUILD_PHOTON_PARAMS}"
else:
install_command = f"lbzip2 -d -c {index_file} | tar x -C {config.TEMP_DIR}"
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Owner
|
Thanks for this PR! Before having done proper testing on this, some questions:
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds support for
.jsonl.zst(JSON Lines with Zstandard compression) file format as an alternative to the existing.tar.bz2format for Photon index data.Changes
zstdpackage to Docker image for decompression support.jsonl.zstfiles using Photon's nominatim-import featureBUILD_PHOTON_PARAMSenvironment variable to configure import parameters (defaults to-languages en,de,fr,es,it).tar.bz2and.jsonl.zstformatsBenefits
.tar.bz2formatTesting
.jsonl.zstfile download and extraction.tar.bz2functionality remains intactBUILD_PHOTON_PARAMSconfiguration works as expectedFixes #122